In [1]:
# @hidden_cell
# The project token is an authorization token that is used to access project resources like data sources, connections, and used by platform APIs.
from project_lib import Project
project = Project(project_id='...', project_access_token='...')
pc = project.project_context

Machine Learning and Model Comparisons on Fashion MNIST Dataset

In the first notebook Part 1 - Data Exploration we've explored the Fashion-MNIST dataset from the Data Asset Exchange. In this notebook we will train three machine learning classifiers that could be used to identify fashion and clothing items and compare their performance. Throughout this notebook we will utilize the scikit-learn Machine Learning library.

Table of Contents:

0. Prerequisites

Before you run this notebook complete the following steps:

  • Insert a project token
  • Install and import required packages

Insert a project token

When you import this project from the Watson Studio Gallery, a token should be automatically generated and inserted at the top of this notebook as a code cell such as the one below:

# @hidden_cell
# The project token is an authorization token that is used to access project resources like data sources, connections, and used by platform APIs.
from project_lib import Project
project = Project(project_id='YOUR_PROJECT_ID', project_access_token='YOUR_PROJECT_TOKEN')
pc = project.project_context

If you do not see the cell above, follow these steps to enable the notebook to access the dataset from the project's resources:

  • Click on More -> Insert project token in the top-right menu section

ws-project.mov

  • This should insert a cell at the top of this notebook similar to the example given above.

    If an error is displayed indicating that no project token is defined, follow these instructions.

  • Run the newly inserted cell before proceeding with the notebook execution below

Import required packages

In [2]:
# Define required imports
import pandas as pd
import numpy as np

from sklearn.metrics import precision_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import recall_score
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.metrics import f1_score
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score

from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier

from warnings import filterwarnings
filterwarnings('ignore')

1. Prepare the Training Data

We start by reading in the training dataset from fashion-mnist_train.csv.

In [3]:
# Training dataset file name
DATA_PATH = 'fashion-mnist_train.csv'

# Create method to find filepath based on filename
def get_file_handle(fname):
    # Project data path for the raw data file
    data_path = project.get_file(fname)
    data_path.seek(0)
    return data_path

# Usepandas to read the data 
data_path = get_file_handle(DATA_PATH)
data = pd.read_csv(data_path).values

# Preview data (label, followed by pixel data)
data
Out[3]:
array([[2, 0, 0, ..., 0, 0, 0],
       [9, 0, 0, ..., 0, 0, 0],
       [6, 0, 0, ..., 0, 0, 0],
       ...,
       [8, 0, 0, ..., 0, 0, 0],
       [8, 0, 0, ..., 0, 0, 0],
       [7, 0, 0, ..., 0, 0, 0]])

Save the pixel data and labels into two arrays.

In [4]:
# Save the pixel data as "pixel"
pixel = data[:, 1:]

# Save the label data as "label"
label = data[:, 0]

We are going to train three Machine Learning algorithms using this data that could be used to identify fashion and clothing items.

Define helper functions

Define a helper function named calculate_metrics, which calculates the following metrics:

display_metrics and display_scores are used throughout the notebook to display metrics.

In [5]:
def calculate_metrics(label, label_predict):
    """
    Calculate accuracy, precision, recall and f-score
    """
    acc_score = accuracy_score(label, label_predict)
    pre_score = precision_score(label, label_predict, average='weighted')
    rec_score = recall_score(label, label_predict, average='weighted')
    f_score = f1_score(label, label_predict, average='weighted')
    return (acc_score, pre_score, rec_score, f_score)

def display_metrics(label, label_predict):
    """
    Calculate and display accuracy, precision, recall and f-score
    """
    scores = calculate_metrics(label, label_predict)
    print("Model Accuracy : {}".format(scores[0]))
    print("Model Precision: {}".format(scores[1]))
    print("Model Recall   : {}".format(scores[2]))
    print("Model F-Score  : {}".format(scores[3]))

    
def display_scores(scores):
    """
    Display scores (e.g. accuracy, precision, etc.) and calculate mean
    and standard deviation
    """
    print("Scores            : {}".format(scores))
    print("Mean              : {}".format(scores.mean()))
    print("Standard deviation: {}".format(scores.std()))
    

2.Train a Decision Tree Classifier

A decision tree is a supervised machine learning technique that can be used to classify data. A decision tree consists of three components: internal nodes, edges/branches and leaf nodes.

  • Internal nodes test on attributes and produce a Yes/True or No/False answer. For example, a node might determine whether a picture contain sleeves.
  • Edges/Branches: Connection between each node or leaf to reflect the outcome of a test. For example, for the above node the answer Yes would be an edge to another node that might determine whether the sleeves are long.
  • Leaf nodes predict the outcome. For example, a picture is classified as Pullover if the previous test determined that the garment has long sleeves.

We will use a scikit-learn implementation of the Decision Tree Classifier and configure it to build a Decision Tree classifier from the pixel and label training data. As hyperparameters) we use a combination that performed well in these benchmarks and yields results quickly. We specify an arbitrary random number generator seed of 42 to allow for reproducible results.

In [6]:
# Build an sklearn.tree.DecisionTreeClassifier from the training dataset
decision_tree = DecisionTreeClassifier(criterion='gini', splitter='best', max_depth=10, random_state=42)

# Train the classifier
decision_tree.fit(pixel, label)
Out[6]:
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=10,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=42,
            splitter='best')

Test the decision tree classifier using the pixel training data. For illustrative purposes we also display the first 20 predictions and expected results to allow for a quick visual comparison.

In [7]:
# Test classifier using the pixel data
label_predict = decision_tree.predict(pixel)

# Review the first 20 labels and predicted labels
print('Correct labels  : {}'.format(label[:20]))
print('Predicted labels: {}'.format(label_predict[:20]))
Correct labels  : [2 9 6 0 3 4 4 5 4 8 0 8 9 0 2 2 9 3 3 3]
Predicted labels: [2 9 4 0 3 4 4 5 4 8 0 8 9 6 2 2 9 3 0 3]

Looking at this small sample, we can already see that not all predictions were correct. Let's calculate and display model accuracy, precision, recall and F-score for the trained classifier.

In [8]:
# display model performance stats
display_metrics(label, label_predict)
Model Accuracy : 0.8479666666666666
Model Precision: 0.8507829440982037
Model Recall   : 0.8479666666666666
Model F-Score  : 0.848208267661459

The trained model has good accuracy, precision, recall and F-score.

Validate model performance using 3-fold Cross Validation

Cross-validation), sometimes called rotation estimation or out-of-sample testing, is one of the various model validation techniques for assessing how well the results of a statistical analysis generalize to an independent data set. It is a statistical method used to estimate the performance of machine learning models. The goal of cross-validation is to test the model's ability to predict on new data that was not used in estimating it, in order to flag problems like overfitting or selection bias and to give an insight on how the model will generalize to an independent dataset.

The general procedure for k-fold cross validation is as follows:

  1. Shuffle the dataset randomly.
  2. Split the dataset into k groups
  3. For each unique group:
    1. Take the group as a hold out or test data set
    2. Take the remaining groups as a training data set
    3. Fit a model on the training set and evaluate it on the test set
    4. Retain the evaluation score and discard the model
  4. Summarize the skill of the model using the sample of model evaluation scores

    Importantly, each observation in the data sample is assigned to an individual group and stays in that group for the duration of the procedure. This means that each sample is given the opportunity to be used in the hold out set 1 time and used to train the model k-1 times.

    This approach involves randomly dividing the set of observations into k groups, or folds, of approximately equal size. The first fold is treated as a validation set, and the method is fit on the remaining k − 1 folds.

In this notebook we perform 3-fold cross validation which means here k=3.

In [9]:
# Scaled Features not required for Decision Tree
decision_tree_scores = cross_val_score(decision_tree, pixel, label, cv=3, scoring="accuracy") 
display_scores(decision_tree_scores)

label_predcv = cross_val_predict(decision_tree, pixel, label, cv=3)
decision_tree_cv = calculate_metrics(label,label_predcv)
Scores            : [0.80325 0.80905 0.8016 ]
Mean              : 0.8046333333333333
Standard deviation: 0.0031948743672048177

The prediction performance of the model is fairly ok. The accuracy, precision, recall and F-score are around 80%. Let's try a different approach and see if we can achieve better prediction performances.

3. Train a Linear Classifier

In the field of machine learning, the goal of statistical classification is to use an object's characteristics to identify which class (or group) it belongs to. A Linear Classifier achieves this by making a classification decision based on the value of a linear combination of the characteristics. An object's characteristics are also known as feature values and are typically presented to the machine in a vector called a feature vector.

In this notebook, we use the scikit-learn implementation of a Linear SGD Classifier. SGD refers to Stochastic Gradient Descent, which is an iterative algorithm to find the target weights of the linear classifier. The feature vector in this case is a vector of pixel values from the image.

A few points to keep in mind when we use this classifier:

  • It requires a number of hyperparameters such as the regularization parameter and the number of iterations.
  • It is sensitive to feature scaling.

We need to build up feature scaling carefully and choose hyperparameters wisely.

Each image in the dataset has 784 features (28x28 pixels vectorized into 784x1 vector) and the value of each pixel ranges from 0 to 255. We use sklearn's sklearn.preprocessing.StandardScaler class to perform feature scaling on the dataset so that the values are in weighted form and in a smaller range. The scaling formula is x_scaled = (x - x_mean) / x_standarddeviation, which is also known as the z-score in statistical analysis. It means how many standard deviation is each point away from the mean value.

In [10]:
# Create an sklearn.preprocessing.StandardScaler instance
scaler = StandardScaler()

# Map pixels with the Scaler
pixel_scaled = scaler.fit_transform(pixel.astype(np.float64))

Build and train a Linear SGD Classifier from the training dataset using a combination of hyperparameters that performed well in these benchmarks and yields results quickly. We specify an arbitrary random number generator seed of 42 to allow for reproducible results.

In [11]:
# Create an sklearn.linear_model.SGDClassifier
sgd = SGDClassifier(loss='hinge', random_state=42, penalty='l2')

# train the classifier using the labels and the feature-scaled pixel values 
sgd.fit(pixel_scaled, label)
Out[11]:
SGDClassifier(alpha=0.0001, average=False, class_weight=None,
       early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
       l1_ratio=0.15, learning_rate='optimal', loss='hinge', max_iter=None,
       n_iter=None, n_iter_no_change=5, n_jobs=None, penalty='l2',
       power_t=0.5, random_state=42, shuffle=True, tol=None,
       validation_fraction=0.1, verbose=0, warm_start=False)

Test the Linear SGD Classifier using the scaled pixel training data.

In [12]:
# Test classifier using the pixel data
label_predict = sgd.predict(pixel_scaled)

# display model performance stats
display_metrics(label, label_predict)
Model Accuracy : 0.8407166666666667
Model Precision: 0.8397222848443929
Model Recall   : 0.8407166666666667
Model F-Score  : 0.8370616634003265

Validate model performance using 3-fold Cross Validation

In [13]:
sgd_scores = cross_val_score(sgd, pixel_scaled, label, cv=3, scoring="accuracy") 
display_scores(sgd_scores)

label_predcv = cross_val_predict(sgd, pixel_scaled, label, cv=3)
linear_classifier_cv = calculate_metrics(label,label_predcv)
Scores            : [0.8331  0.83045 0.82935]
Mean              : 0.8309666666666665
Standard deviation: 0.0015739193823770365

It appears that the trained Linear SGD Classifier is performing better than the Decision Tree classifier. Let's try one more classifier.

4. Train a Logistic Regression Classifier

In statistics, the logistic model (or logit model) is used to model the probability of a certain class or event existing such as pass/fail, win/lose, alive/dead or healthy/sick. This can be extended to model several classes of events such as determining whether an image contains a cat, dog, lion, etc. Each object being detected in the image would be assigned a probability between 0 and 1. Logistic regression is a supervised classification algorithm.

In this notebook we use the scikit-learn implementation of a Logistic Regression Classifier and apply a hyperparameter combination that performed well in these benchmarks and yields results quickly. We specify an arbitrary random number generator seed of 42 to allow for reproducible results.

In [14]:
# Create an sklearn.linear_model.LogisticRegression classifier
log = LogisticRegression(multi_class="ovr", penalty='l2', solver="lbfgs", C=10, random_state=42)

# train the classifier using the labels and the feature-scaled pixel values 
log.fit(pixel_scaled, label)
Out[14]:
LogisticRegression(C=10, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr',
          n_jobs=None, penalty='l2', random_state=42, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)

Test the Logistic Regression Classifier using the feature-scaled pixel training data.

In [15]:
# predict dataset pixel_scaled using trained model
label_predict = log.predict(pixel_scaled)

# display model performance stats
display_metrics(label, label_predict)
Model Accuracy : 0.8741666666666666
Model Precision: 0.8726902548900409
Model Recall   : 0.8741666666666666
Model F-Score  : 0.8728898461905751

Validate model performance using 3-fold Cross Validation

In [16]:
log_scores = cross_val_score(log, pixel_scaled, label, cv=3, scoring="accuracy") 
display_scores(log_scores)

label_predcv = cross_val_predict(log, pixel_scaled, label, cv=3)
log_regression_cv = calculate_metrics(label,label_predcv)
Scores            : [0.84665 0.84535 0.8439 ]
Mean              : 0.8453
Standard deviation: 0.0011232393630329627

The prediction power of the Logistic Regression Classifier is slightly better than that of the SGD Classifier, comparing their 3-fold cross validation scores.

5. Compare Model Performance

Let's compare the three model's cross validation performance side by side!

In [17]:
model_comparison_df = pd.DataFrame([decision_tree_cv, linear_classifier_cv, log_regression_cv], 
                                   columns =['Accuracy', 'Precision', 'Recall', 'F-Score'], 
                                   index=['decision_tree_cv', 'linear_classifier_cv', 'log_regression_cv'])
model_comparison_df
Out[17]:
Accuracy Precision Recall F-Score
decision_tree_cv 0.804633 0.805922 0.804633 0.803913
linear_classifier_cv 0.830967 0.832031 0.830967 0.829820
log_regression_cv 0.845300 0.843181 0.845300 0.843766

In our example a comparison of accuracy, precision, recall, and F-score indicates that the trained Linear Regression classfier would yield the best prediction results.

Next steps

  • Close this notebook.
  • Open the Part 3 - DL and Model Evaluations notebook.

Authors

This notebook was created by the Center for Open-Source Data & AI Technologies.

Copyright © 2020 IBM. This notebook and its source code are released under the terms of the MIT License.

Love this notebook? Don't have an account yet?
Share it with your colleagues and help them discover the power of Watson Studio! Sign Up